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Being able to measure each merger's sky location, distance, component masses, and conceivably 
spins, ground-based gravitational-wave detectors will provide a extensive and detailed sample of co- 
alescing compact binaries (CCBs) in the local and, with third-generation detectors, distant universe. 
These measurements will distinguish between competing progenitor formation models. In this pa- 
per we develop practical tools to characterize the amount of experimentally accessible information 
available, to distinguish between two a priori progenitor models. Using a simple time-independent 
model, we demonstrate the information content scales strongly with the number of observations. 
The exact scaling depends on how significantly mass distributions change between similar models. 
We develop phenomenological diagnostics to estimate how many models can be distinguished, using 
first-generation and future instruments. Finally, we emphasize that multi-observable distributions 
can be fully exploited only with very precisely calibrated detectors, search pipelines, parameter 
estimation, and Bayesian model inference. 



I. INTRODUCTION 

Ground-based gravitational-wave interferometric de- 
tectors like advanced LIGO and Virgo are expected to 
detect at least a few coalescing compact binaries (CCBs) 
per year, based both on semi-empirical extrapolations of 
Milky Way binary pulsar statistics [TH3] and on theoret- 
ical predictions both of isolated binary |4-|)j and clus- 
tered evolution [T0HT3] . Each detected waveform should 
reveal the sky location, distance, component masses, and 
conceivably even component spins. Corrected for the 
known biases of gravitational- wave searches, the observed 
multi-property distribution can be compared statistically 
against any proposed model for CCBs. In other words, 
the set of binary mergers will help us choose between 
competing models for binary compact formation. 

Comparison of models against data is well-explored in 
general [21 [TS] , in astronomy [TB] , in stellar and binary 
evolution [TJ [3J rT7r[2"D] , and in gravitational wave astro- 
physics 21 23J. In brief, parameter and estimation and 
hypothesis testing relies on Bayes' theorem, where a par- 
ticular "data" (here, the set of all signals) is compared 
with a (possibly continuous) space of models: 

where p{d\H) is the marginal probability for the data 
given ri; p(H\d) is the posterior probability given the 
data is observed; p("H) are our prior probabilities for the 
data; and p(d) is self-consistently set to account for all 
models under consideration: 

p(d) = J dnp{d\H)p{n) (2) 

Models with the highest posterior probability are favored; 
relative probabilities give "odds ratios." The above ex- 
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pressions are often expressed with coordinates for the 
model space, denoted henceforth by A. 

While generic and effective, Bayes' theorem's utility 
relies on our ability to efficiently examine all reasonable 
scenarios and, for each, to accurately estimate posterior 
probabilities p(d\H). All state-of-the-art detailed source 
models for binary compact object formation are compu- 
tationally expensive [24j [25]. Except for a handful of 
efforts [SJ [HI HZ] , few large-scale parameter surveys have 
been attempted. Moreover, these Monte Carlo studies 
estimate posterior probabilities p{d\H) with limited accu- 
racy. Therefore, though computation-intensive (Markov- 
chain Monte Carlo) methods have been usefully applied 
to estimate p(rl\d) for similarly complicated distribu- 
tions, such as for the parameters of a single binary merger 
(2BH2E]> a complete blind survey of the binary evolution 
model space remains computationally unfeasable. 

Bayes theorem also does not interpret the information 
it provides: it only provides posteriors. One way of in- 
terpreting that information arises naturally, if the prob- 
lem is nearly solved. Eventually, many observations may 
tightly constrain the parameters A of some hypothetical 
formation scenario to a small coordinate region surround- 
ing an optimal model A*. In this limit, the posterior from 
a set of observations can be well-approximated by a nar- 
rowly peaked gaussian: 

r afc (A - A*) Q (A - A*)b 
p(X\d) oc exp - (3) 

for some matrix T a b. In this well-understood limit, each 
subsequent further observation both refines the optimal 
parameters and narrows the likelihood (i.e., increases 
r). The impact of any one proposed measurement (or 
model variation) can be computed perturbatively. While 
asymptotically useful, at this stage we have neither an ad- 
equately accurate model for binary evolution nor a good 
handle on what model parameters nature prefers. Clearly 
more robust diagnostics are required. 

Realistically, many models predict a similar rate and 
mass distribution as any we could conceivably recon- 
struct from the first gravitational wave detections. To 
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best exploit the available information gravitational wave 
signals provide, we will use both the number and proper- 
ties of each gravitational wave signal. But how? Equally 
realistically, the number and properties predicted by 
neighboring distributions depends on the reference point 
where we identify neighbors: the answer we seek depends 
on what nature provides. How can we assess how much 
and what information each detection provides? How can 
we characterize how many models predict similar binary 
parameter distributions? Given a proposed model, how 
can we devise simple, phenomenological diagnostics that 
characterize critical differences from neighboring models? 

In this paper we describe a general yet practi- 
cal method characterize model differences, using a 
coordinate-free "generalized distance" on the space of 
distributions. Our discussion is broadly applicable to 
experiments where on the one hand experiments mea- 
sure several properties of each event but on the other 
hand theory cannot produce comparable multivariate dis- 
tributions without significant computational cost. Our 
method is statistically well-posed; applies equally well 
to coarse or fine sampling of the model space; can em- 
ploy as many or few experimental measurables as desired; 
can be easily modified to account for experimental er- 
ror; and asymptotes to a maximally informative Fisher 
matrix analysis in the limit of small statistical and sys- 
tematic errors. To illustrate this method, we employ a 
concrete toy model, with parameters qualitatively suit- 
able for gravitational wave detection of a population of 
merging compact binaries. In Section [TT] we review the 
theory underlying our diagnostic, the Kullback-Leibler 
(KL) divergence. We explain how, given a model, we 
can sort all other models by their "proximity" to it. We 
describe the formal relationships between our diagnos- 
tic, the log-likelihood, and the Fisher matrix. Following 
other recent studies, we define an effective local dimen- 
sion to characterize the complexity of the local model 



space. In Section III we explore how the region of mod- 
els consistent with observations decreases in size as the 
number of observations increases. As an example, in Sec- 



tion IV we apply this method to a physically-motivated 



toy model for future gravitational wave observations. 

For simplicity, in this work we adopt a highly idealized 
model for gravitational wave detection, where our un- 
changing detector has perfect success at identifying any 
binary inside a sphere of radius D max and complete fail- 
ure outside that distance. We likewise assume mergers 
occur uniformly in space and time. More realistic mod- 
els of time-dependent star formation and cosmologically 
significant reach will be addressed future publications. 



Context and related work 



populations may be selected by gravitational waves, no- 
tably binary black holes, qualitatively similar methods 
can be applied to weakly constrain binary evolution us- 
ing the first few detections |3T]. Previous studies have 
also realized that the greater information available per 
detection allows even stronger constraints |31j . 

Roughly speaking, the implied parameter distribu- 
tions can be reconstructed nonparameterically, using the 
known statistics of the detection process. For example, 
the binary pulsar merger rate is reconstructed with Pois- 
son statistics [U [3] . The binary parameter distribution 
can be similarly reconstructed, using either an ad-hoc or 
physically motivated kernel to describe each point's am- 
biguity function [32]. 1 We will not address the problem 
of distribution estimation; we take the recovered distri- 
butions and event rates as input. 

Bayesian model selection theory has been extensively 
applied in gravitational wave physics, though rarely to 
differentiate between progenitor population models. In 
the most comparable model, model selection was em- 
ployed to assess how well a discrete family of models for 
supermassive black hole growth could be distinguished 
via their gravitational wave signal, as seen by pulsar 
timing arrays |33j . In the context of gravitational wave 
astronomy, model selection has usually been applied to 
explore the significance of proposed violations of Gen- 
eral Relativity [2"Tl - f2"3"] . Most directly comparable is a 
prototype study by Mandel [32], which investigated how 
gravitational wave events provide increasing evidence for 
phenomenological mass distribution parameters. In as- 
tronomy, Bayesian model selection is applied far more 
extensively. In one pertinent example, the known masses 
of black holes and neutron stars can be fit to different 
models QS [37]. 

Other simpler methods have also been provided to dif- 
ferentiate between mass distributions and to determine 
which mass distributions best fit the data. For example, 
a simple one-dimensional Kolmogorov-Smirnov (KS) test 
can be applied to compare observed binary (chirp) masses 
to each model's predicted distribution [HO [2D]- Ad- 
ditionally, gravitational wave observations provide only 
some information. Stellar binary populations and pulsar 
recoil velocities also weakly constrain binary evolution 
[T71 13"5] . We will not address other comparison diagnos- 
tics or multi-observable statistics in this paper. 

In principle, the gravitational wave signal encodes all 
multipole moments of the radiating spacctimc. Nonethe- 
less, via perturbation theory and separation of timescales 
|39H41j . the imprint of gravitational waves on ground- 
based detectors can often be well- approximated by much 
simpler signals. Equivalently, the true high-dimensional 
signal manifold (a submanifold of L 2 : all possible time- 



Though measured through other means, binary com- 
pact object merger rates are already used to constrain 
progenitor models, both in the Milky Way [3[ [T7] and 
throughout the universe [S] [U [35J 130] • Though different 



In fact, parameter estimation methods applied to each detection 
provide a very detailed model for the ambiguity function associ- 
ated with each event. 
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series) is very close to a much lower-dimensional subman- 
ifold. For weak signals, data analysis only requires the 
low-dimensional submanifold. For strong signals, allow- 
ing fine discrimination between signals, model estimation 
can require a higher effective dimension. Both the rel- 
evant effective dimension |42j to a particular amplitude 
and the change in effective dimension with amplitude [43] 
enter directly into gravitational wave data analysis and 
parameter estimation. 



II. DIAGNOSING DIFFERENCES 



In this section we introduce a surprisingly simple but 
robust diagnostic that differentiates between predictions 
for the number and nature of detections. Our suggestions 
are motivated by gravitational wave data analysis, where 
severe computational burdens 2 limit the ability to gen- 
erate a large family of reliable predictions. Rather than 
a continuous model space that can be inhnitcsimally re- 
fined without effort, the model space we must work with 
is invariably discrete and potentially poorly sampled. We 
need tools that distinguish between models; that identify 
potential undersampling of the model space; that esti- 
mate the number and nature of pertinent degrees of free- 
dom; and that do so in a well-understood fashion that 
nonetheless can maximally exploit all available informa- 
tion. Moreover, these methods should ideally work with- 
out coordinates, owing to the extremely high dimension 
of the model space; difficulty in a priori identifying natu- 
ral variations; and computational intractability thorough 
sampling along any single dimension. 

Our diagnostic is surprisingly simple: the log likelihood 
difference, averaged over all data realizations that one of 
the two implies. This difference factors into two contribu- 
tions, measuring the difference in (a) the expected num- 
ber and (b) the "shape" (i.e., parameter distribution). 
Moreover, this diagnostic has close, provable connections 
to many statistical tools familiar from gravitational wave 
data analysis, notably the likelihood and Fisher matrix. 
This diagnostic nonetheless allows us to identify "simi- 
lar" systems blindly, even when the "similarity" region 
extends over a large fraction of model space. In turn, 
the set of "similar" systems naturally defines character- 
istic shape variations, letting us determine the number 
and nature of pertinent degrees of freedom nonparamet- 
rically. 



2 Severe computational burdens are associated both in generating 
astrophysical predictions and in calculating the selection biases 
that connect astrophysics to parameter distributions for the ob- 
served sample. 



A. KL divergence denned 

For our purposes, a binary evolution model makes two 
predictions: first, /x, the average number of events our de- 
tector should identify; and second p{x)dx, the probability 
distribution of different binary parameters among the set 
of detected events (e.g., x consists of component masses, 
spins, location, and orientation). 3 To be concrete, as each 
detection occurs independently of all others, the proba- 
bility that our detector will identify binaries with each 
Xk between Xk , Xk + dxk is 

P(xi . . . x n )d n x = d n xp(n\n)Y[p(x k ) (4) 

k 

p(n|/x) = ^e"" (5) 

Conversely, given a model X = (/J,,p) and a set of events 
d = (xi . . . x n ), we can define a likelihood estimate L 

L[X) = L(X\d) =]nP(x 1 ...x n ) (6) 

where for shorthand we omit explicit dependence on the 
data realization d. Modulo priors and model dimension 
penalties, models with higher peak L are more plau- 
sible estimates for the generating process for x\ . . . Xk 
than models with lower L. Suppose each measurement 
is independently drawn instead from a fiducial model 
X* = (/i* , p* ) . Averaging over all measurements implies 

(L{X)) x = (InpH/i)) + (n) J d^—p^x) In q(^) 

If the fiducial and test models X, X* are equal, the av- 
erage is the sum of (a) the entropy of the poisson 

distribution plus (b) (n) = /i* times the entropy of the 
parameter distribution p. In the more general case where 

X 7^ A*, the mean log likelihood (^Lj will be smaller 

than this bound. We characterize the decrease in ex- 
pected log likelihood with the KL divergence. 

For a general pair of probability distributions 
p(x),q(x), the entropy H p and KL divergence Dkl(p\i) 
are defined by [13 IS] 

H p = — J dxplnp (8) 

DKh(p\q) = / dxplnp/q (9) 



In practice, detector-dependent selection biases strongly modify 
the underlying parameter distribution. For example, for grav- 
itational wave data analysis, the detection range scales as the 
(chirp) mass to the 5/6 power: high mass sources are vastly eas- 
ier to find, though intrinsically rare. For simplicity, we assume 
that both the detector and astrophysics are well-known and thus 
that p(x) is completely determined. 
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Roughly, the KL divergence characterizes the informa- 
tion gain the data must provide to go from a prior p to a 
posterior q. The KL divergence is non-negative definite 
with D = if and only if p = q. The KL divergence is 
not symmetric. Substituting into Eq. (|7|), the 



L{X)) = -[D K Mn) + H Mt ] 



x 



-H*[ d kl(p*\p) + H p ,} (10) 



where £)(/i*|/z) is shorthand for the KL divergence be- 
tween two poisson distributions: 

Dkl{^*\v) = y^pM/j*)[/i - m* +nlmju*//i)] 

= /i* ln(/x*//z) (11) 
iJ^ = — ^ ] p(n|/i)[— jit + nlrx/i — Inn!] 



1 



In 27re/i 1 



(12) 



In particular, given a fixed reference model with parame- 
ters A*, the expected difference in log likelihood between 
two candidate models Ai, A2 can be expressed as a sum 
of two contributions: 
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L(X(M))\ -(L(X(\ 2 ))) ) 
\ 1 * \ / * 

= -«lnp(n|/ii)> - (lnp(n|/x 2 ))) (13) 

= [-Da-l(a**|a*i) - £>a'l(m*Im2)] 

+M*[£»ifi,(p*|gi) - D KL (p*\q 2 )] (14) 



The KL divergence therefore provides a simple, in- 
variant diagnostic, quantifying differences between mod- 
els. Further, the differences it identifies are statistically 
meaningful, connected to differences in (expected) log 
likelihood. Finally, the differences factor: the two terms 
tell us how to weight models' differences, on the one hand 
in rate (the mean number of detections) and on the other 
hand in their predicted parameter distributions. 



B. Fisher matrix and local dimensionality 

One way to discriminate between models relies on max- 
imum likelihood. Observations of viable models have 
L increase (with decreasing relative variance) as more 
observations Xk accumulate. For any given data real- 
ization, statistical fluctuations insure that many models 
with only marginally smaller (L) cannot be reasonably 
distinguished. We therefore want to know how many 
models are "nearby," in the sense that the average (L) 
is within some threshold of the value for the reference 
model itself. 

For well-determined observations, the log likelihood 
has a narrow peak, defining a c? paj , ams -dimensional ellip- 
soid. In a small neighborhood surrounding the reference 



model A*, the mean log likelihood can be expanded in 
series using the Fisher matrix r a ;,: 

(L(A)) ~ (L(A*)) - \{\ - A*) a (A - A*) 6 r a6 (15) 

where for brevity we use (A — . . .) a to denote the coordi- 
nate vector (A Q — ...„) and similarly. In particular, given 
a threshold AL in log likelihood, a coordinate volume 
of order (AL) dparam3 / 2 / y|T| has (median) log likelihood 
within AL of the maximum likelihood point. This coor- 
dinate volume can be very small and scale very favorably 
with AL, given the many parameters d params > 7 com- 
monly modified in binary evolution [fH 1171 124j . 

In practice, however, many detections will be required 
to tightly confine all binary evolution model parameters. 
Rather than an ellipsoid, a threshold AL simply restricts 
to some extended subspace. Nonetheless, we can still 
compute (L) for any model and thus pair of models. If 
we can sample the model space, then for each reference 
model, we can still quantify how many models "look sim- 
ilar": the coordinate volume V{< AL\\*) defined by 



V(< ALIA*) = J 



dX 



L(A)-L(A»)<-AL 



(16) 



For each model this relation defines a one-dimensional 
function V(< AL). For very small AL, the parame- 
ter volume will scale as V cx (AL)^"™""/ 2 as described 
above. At larger likelihood differences, some parameters 
are weakly constrained while others are determined. In 
this regime, we define an "effective dimension" rf e // by 



d eff = 2 



d\nV 
dlnAL 



(17) 



On physical grounds, we anticipate d e // should usually 
decrease monotonically as AL increases. 4 For large like- 
lihood differences, all models are consistent and V con- 
verges to the whole parameter space. 

In practice, the simulation space cannot be exhaus- 
tively explored: not enough computing power is available 
to sample all possible likelihood differences. Nonethe- 
less, for each candidate reference simulation, the func- 
tions V(AL) can be easily estimated by evaluating (L) 
for all other simulations, building a histogram, and fitting 
accordingly. 

To this point we have used a composite discriminant 
((L)) involving both rate and shape. One can also deter- 
mine how many models have a similar event rate alone or 
parameter distribution alone as our reference model. To 
answer the first, let us translate a threshold on AL rate , 



4 The derivative of the volume is mathematically equivalent to the 
(logarithmic derivative of the) density of states. Many examples 
in condensed matter physics show the density of states and by 
implication d e ff can behave unexpectedly in fine-tuned scenar- 
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the contribution to the log likelihood from the event rate 
alone, to an uncertainty in the event number /z: 

AL rate = -D KL (fj,*\Lii) ~ -^*[ ln W^*)] 2 (18) 

A threshold on AL rate corresponds to a relative uncer- 
tainty <fyt//i* ~ y/2AL rate / in the event rate. In other 
words, the event rate provides a single real measurable 
parameter 5 (In /j,) which can be measured to an accuracy 
scaling as \/AL//i» (i.e., the poisson limit oc 1/s/n, if the 
likelihood threshold AL doesn't change with the number 
of events or location on the parameter manifold) . 

By comparison with above and the process of elimi- 
nation, the shape diagnostic Dkl{p*\p) constrains the 
remaining d e ft — 1 parameters. Explicitly, the KL diver- 
gence Dk l {p* \p) between a reference distribution p* and 
perturbed configuration p can be expanded as 



D K l(p* \p) 



J p* lnp^/p 
SX a SX a I d 2 



d\ a d\ b 



lap 



(19) 



in the limit of infinitesimal parameter change <5A = A — A* 
and assuming p is never precisely zero. 6 [The linear- 
order term cancels, as Jp = 1 for all A.] In princi- 
ple, the final expression lets us calculate the contribu- 
tion to the positive-definite Fisher matrix T a b due to 
shape changes Sp in terms of tabulated shape distribu- 
tions p(x\X). In practice, however, numerical estimates 
of p(x\X) are rarely accurate enough to allow accurate 
second derivatives. More critically, the parameter space 
has not been thoroughly enough explored to permit this 
second derivative to be accurately evaluated, except in a 
handful of selected cases. 



C. Characteristic shapes versus tolerance 

Even the most complicated binary progenitor models 
have a finite number of parameters. This number sets 
the number of ways the distribution p could change sig- 
nificantly. For example, in the limit of arbitrarily small 
variations 5X about A*, the change 5p in p can be com- 
puted via the d params independent logarithmic deriva- 
tives 9a In p. These variations prove particularly useful 
at identifying and quantifying the impact of each pa- 
rameter, and at identifying correlations. For example, 
these variations determine how the entropy H(p«) and 
therefore likelihood of the reference model changes as we 



Binary evolution comparisons are particularly simple when event 
rate is used as a coordinate. 

A model where physics completely forbids a particular configu- 
ration can always be distinguished from one that does not, via 
a single observation with those conditions. We will not discuss 
this limit here. 



move A* across the parameter space. Conversely, if we 
fix the reference model A* , these (constrained) variations 
determine the difference in log likelihood between the 
reference model and our candidate point using Eq. (19 1. 



Finally, practically speaking these variations help us in- 
terpret the space of models and determine what measure- 
ments (i.e., what data) best discriminate between them. 
Some measure of the list of possible shape variations 
Ship allowed near a particular reference configuration 
A* therefore proves exceptionally useful. Unfortunately, 
derivatives like d\p and dfp are too difficult to calculate 
reliably when standard Monte Carlo methods for progen- 
itor model simulation are employed. 

We can nonetheless estimate the space of allowed vari- 
ations directly from a large but discrete sample of the 
parameter space near some A. Specifically, given a ref- 
erence model selected from a large collection of random, 
uniformly sampled models, we first eliminate all mod- 
els with expected log likelihood farther than AL (i.e., 
that lie outside U(AL|A*)), assuming data ensembles 
are drawn from the reference model. The subset that 
satisfies this condition will randomly sample the consis- 
tent volume. If the limiting volume is sufficiently small 
as to be ellipsoidal, then the random sampling will en- 
sure more samples along then large eigendirections of F 
and fewer in other directions. We estimate (and rank) 
the plausible variations in the neighborhood of A* via 
a principal component analysis of the N samp distribu- 
tions pa = Pi ■ ■ -PN 3amp where A — 1 . . . N samp indexes 
the sampled points: As described in more detail in the 
appendix, functional principal component analysis corre- 
sponds to the eigenfunction problem for the correlation 
operator [45] : 



C = 



1 



N 



samp j^—i 



^2 \Spa) ($Pj 



(20) 



where Spa(x) = pa(x) — p*{x) and where we adopt a 
quantum-mechanics-motivated notation for brevity. In 
this notation, inner products are evaluated with flat norm 



(a\b) 



dxa* (x)b(x) 



(21) 



The most significant eigenfunctions and eigenvalues of C 
tell us the most significant ways the predicted parameter 
distribution p varies in the sampled domain. In general 
this operation will admit many eigenvectors, potentially 
more than d par ams ■ We expect the number of marginally 
significant eigenfunctions to be particularly large when 
the constrained volume V{AL) is significantly noncllip- 
soidal. 

By construction, this operation will recover d\p as 
eigenfunctions in the limit of a small ellipsoidal volume. 
To prove this, expand each simulation's functional varia- 
tion Spa in series about , using their known coordinate 
locations 5Xa,o,'- 

bp a ^ 5X A , a d\ a p 
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Substituting this approximation into the definition of C, 
we find 

C ~ \d Xa p) (dx b p\ ^2SX A ,aS\A,b 



N 

\d\ a p) (d\„p\ T ab 



(22) 



In the second line, we are replacing the estimate of the 
coordinate correlation function by its expected value, 
known from the shape of the ellipsoidal region. The 
eigenfunctions of C are therefore the eigenvalues of T dot- 
ted into derivatives d\p. 

Generally, the principal components naturally and in- 
variantly characterize the degrees of freedom available to 
us. Critically, this method neither requires coordinates 
nor needs fine sampling: the principal components of a 
coarse grid naturally characterize variation on the coarse 
scale. 



D. Effective dimensions greater than the number 
of model parameters? 

When the model space has indistinguishably small 
scales, the effective dimension can be significantly lower 
than the number of parameters needed to specify the 
model. At the other extreme, the effective dimension 
can also be higher, either by the model manifold "filling 
volume" by folding on itself or by "rotating" through the 
infinitely many degrees of freedom possible in a space 
of distributions. To understand the latter, imagine the 
space of all possible parameter distributions p{x) as a 
high-dimensional vector space; the model space is some 
low-dimensional submanifold. As we illustrate by exam- 



ple in Section IV that submanifold can "rotate" through 
many dimensions, with significant differences in the num- 
ber of local degrees of freedom. On the other hand, the 
model space (like a fractal or ergodic curve) may also 
wind back around near itself, creating the appearance of 
greater complexity and higher dimension. 

In both scenarios, both approaches for calculating lo- 
cal dimension - the effective dimension and a principal 
component analysis - will find model dimensions greater 
than the number of underlying parameters. The number 
of degrees of freedom identified by principal component 
analysis will be larger (possibly much larger). For ex- 
ample, as illustrated in Section [TV] principal component 
analysis treats distinct, nonoverlapping parameter distri- 
butions p{x) as two independent basis vectors: all linear 
superpositions are allowed. 

Whenever the effective dimension is large, observations 
only weakly constrain our model space: many if not 
all model parameters cannot be constrained. Instead, 
a purely phenomenological approach should be adopted, 
based on the range and differences of distributions seen 
in simulations. In special cases, a principal component 
analysis may suggest that some parameters can be con- 
strained but other degrees of freedom should be handled 



phenomenologically. This situation can arise naturally 
when certain parameter predictions are well-sampled and 
tightly connected to one or a few model parameters (e.g., 
the strength of supernova kicks on pulsars; the masses of 
neutron stars) but the remainder are not. We will address 
this scenario concretely in a subsequent publication. 



III. DIAGNOSTICS AND DETECTION 

Knowledge of how much L can fluctuate between data 
realizations helps determine what threshold to set for 
likelihood differences AL for different n. 

For a specific experiment duration and scale, by con- 
struction our diagnostic tells us how well a maximum- 
likelihood statistic could distinguish between any two 
models. A more sensitive or longer experiment will ac- 
cumulate more events, increasing the average maximum 
likelihood, reducing the relative likelihood range AL/L 
that observations do not distinguish between, and there- 
fore shrinking the observatinally-consistent parameter 
volume. How does the discrimination process scale with 
the number of events, on average? How much do random 
fluctuations impact the likelihood difference? How does 
the size of the ambiguity region scale with the number of 
observations? 

To address these questions requires nothing more than 
a model for how the average and standard deviation of L 
change with the expected number of events. Specifically, 
if X = (fJ-,p) and X* are models, with A* the truth, we 
want to know the statistical properties of the likelihood 
difference 



(23) 



As each successive measurement is independent, the 
mean log likelihood difference between any two models 
grows linearly, with a rate that depends on X and A*. 
By contrast, fluctuations in the log likelihood difference 
can grow in complicated ways with the number of events, 
depending sensitively on how distinguishable the two in- 
ternal parameter distributions are. For example, if the 
reference and test models are identical (X = A*) the 
likelihood difference is trivially always zero, with zero 
variance. By contrast, a if the reference and test models 
have dramatically different parameter distributions (e.g., 
p/p* ~ in some region), the likelihood differences will 
generally be large and highly variable, depending on how 
often sample points occur in the region where the two 
models differ. 

Previous studies have described how to use either the 
event rate or some parameter distribution (e.g., the mass 
distribution) to differentate between models. To connect 
to these previous studies, to simplify the calculations, 
and to allow time to reflection on their implications, we 
too explore how well partial information helps distinguish 
between progenitor scenarios. 
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Average versus fluctuating likelihood 1: Event 
rate alone 



Suppose our detector only identifies candidate events. 
[For simplicity we continue to assume the detector oper- 
ates with perfect confidence.] On average, each succes- 
sive independent measurement changes the expected log 
likelihood L of a proposed model X = (fJ,,p) by a fixed 
amount: 



L{X)-L(X.) 



\np{n\i£)/p{n\ii*) 



(24) 



Differences in likelihood between two models therefore 
grow linearly with (average) detected number (i.e., lin- 
early with time or range cubed). As one would ex- 
pect from Eqs. (7pT), to an excellent approximation, 
the average change per event can be ap prox imated by 
D KL (n*\fi)/n* ~ -(ln/i/^*) 2 /2 [see Eq. Q]. 

By contrast and as expected from standard Poisson 
statistics, fluctuations in the log likelihood scale as a 
smaller power {^/J^) of the number of events. To demon- 
strate this result in the notation provided below, the 
standard deviation of 8L(X) = L(X) — L(X*) about its 
mean value can be expressed in terms of certain moments 
of the poisson distribution: 



([ln(p(n|/x)/p(n|/i*))] 2 ) - (ln(p(n|/x)/p(n|/i a 



The second term is precisely Dkl(^*\^) 2 [Eq. (11)]; the 
first term can be evaluated directly: 



number of measurements improve. Since the (average) 
log likelihood difference increases linearly with the num- 
ber of measurements, the ability of measurements to dis- 
tinguish between models is best characterized by the rate 
of increase in distinguishable volume with (average) like- 
lihood: the effective dimension. 

The local dimensionality of a model space is eas- 
ily understood by using In // as a coordinate. Suppose 
g(ln n)d In // is our model density (e.g., our prior probabil- 
ity on In //) . The above criteron shows that by measuring 
N events, model s ca n be distinguished to an accuracy of 
order In /i ~ 1 /\2N . If the prior probability in this small 
neighborhood is nearly constant, the local dimension is 
nearly unity. 

By contrast, observations may support predictions at 
the upper or lower end of the range consistent with the- 
ory, where g(ln/i) is rapidly changing and dropping to 
zero. In this region, observations have far greater dis- 
criminating power. In this case, the effective dimension 
reflects the functional form of the prior in near the sample 
point, averaged over scales ~ 1/vN in ln/i. 

By way of concrete example, if In /x is a linear function 
of a high-dimensional bounded parameter space, then the 
upper limit of In // generally involves a either a "corner" 
or local extremum of In//. As a result, the prior density 
g will have a power-law decay near the cutoff, gd In // cx 
(In /// n m ax) a ~ 1 d hi /•*■ If sufficiently many measurements 
isolate us to a small neighborhood of fi ma x: the effective 
dimension will therefore be 2a. 7 



B. Average versus fluctuating likelihood 2: Spatial 
dependence alone 



([m(p(n|/i)/p(n|//*))] 2 ) = ^p(n|//*)[/t - //* 



(M - M< 
2/i* (/« 



■) 3 H 
- M* 



Mm* -+ 

* ln(/////» 



As a result, we conclude fluctuations in the likelihood 
difference between models are infinitesimal when those 
models make the same predictions, but can be significant 
when they make different predictions 



SL 



//* [in/////*] ~ 2D K l(v*\ij) 



nln(//*///)] 2 ^ ^he °t ner extreme, two models that predict the 
same number // = /** of events on average can be dis- 
lMlnf, , >, languished if they predict distinctly different parameter 
distributions. This seemingly degenerate case is surpris- 
(26ih gly representative: detailed event rate calculations rely 
on highly uncertain inputs, such as the total amount of 
star-forming gas and the fraction of stars forming in bi- 
naries, that impact all predictions proportionally. 

In this limit, as before, each successive independent 
measurement changes the expected log likelihood of a 
proposed model A by a fixed amount, in this case by 



(27) 



Fluctuations in SL are smaller than its mean value when- 
ever the number of detections are sufficiently large, com- 
pared to the relative difference ln/z/zx* between them: 



/ L(A)-L(A,) ' 

\ 



Qnp/p*) =D KL (p*\p) (29) 



1 > 



U SL 

(SL) 



/Jul ln^/M 
D kl (^\ij) 
1 

v/2/x*| In/////* 



y/D KL (n*\(i) 



(28) 



How do measurements improve with more points?: By 
construction, measurements improve in proportion to the 
rate at which the distinguishable volume decreases as the 



As before, the log likelihood difference between the two 



7 The effective dimension in general is twice the derivative of the 
logarithm of the cumulative distribution in a parameter against 
the logarithm of that parameter. The factor of two arises when 
we interpret the cumulative distribution as a volume of a sphere 
and the parameter as a "radius squared" of the constrained vol- 



models fluctuates between different data realizations. 
Taking care to distinguish between all the n independent 
samples involved in the likelihood, we find 



SL 



lnp(x k )/p*(x k ) 



(nlnp/p^) 2 (30) 



The second term is simply plD KL (p t \p) 2 . The first term 
can be reorganized into a sum over n terms quadratic in 
a function of the point x k and a sum over n(n — 1) terms 
where the two points are distinct: 

^[^]lnp(a; fc )/p st (x fe )] 2 ^ = (n(n - 1 )) (Inp/p^f 

+ (n) ([Inp/P*] 2 ) (31) 



£{D KL (p*\p)] 2 +n, ([\np/p*] 2 ) 



(32) 



Fluctuations in SL are smaller than its mean value when- 
ever the number of detections is sufficiently large com- 
pared to a ratio characterizing how different the two pa- 
rameter distributions p,p* are: 8 



1/2 



f > 



U SL 

(SL) 



([lnp/p,] 2 ) 
]MD KL (p*\p) 



(33) 



Both ([lnp/p*] 2 ) and (lnp/p*) = —Dkl(p*\p) char- 
acterize differences in two distributions. In general, the 
two do not agree, as the first therefore is inevitably larger 
than the square of the second: 

([Inp/p,] 2 ) = (lnp/p*) 2 + ([lnp/p* - (Xnp/p*)] 2 ^) 

For example, substituting p, p* both normal distributions 
with zero mean and standard deviation er, er* , we can eas- 
ily show 



<[lnpM] 2 ) = ([lnp/^]) 2 + 



2a± 



Critically, in the limit er — > er*, the second term scales 
linearly as Dkl(p*\p)i not quadratically. We anticipate 
similar behavior in general. Specifically, we can always 
choose q p = —Dkl(p*\p) as one coordinate for the distri- 
bution space . Since both terms go to zero (smoothly) , 
both can be expanded in Taylor series, in integer powers 



1, 2, . . . of q p . As in the event-rate-only case 
Eq. (27)], a leading-order linear term cx Dkl(p*\p) is 



q,p for s 



expected in general. 

As in the rate-only case [Eq. p7|], the linear-order 
term dominatees the ratio on the right side of Eq. ( 33 ) 



in the neighborhood of small Dkl (p* \p) ■ Expanding the 
ratio in scries, we find 



([lnp/p*] 2 ) 



CqDkl 
1 r 



CiD 2 KL 



ciD KL (p*\p) 



(SL) " " V d kl{p*\p) 



(35) 



This formal expression has an intuitive interpretation: 
to distinguish two very similar parameter distributions 
p,p* (i.e., two distributions with Dkl(p*\p) — 0) re- 
quires extracting many sample points from the distribu- 
tion (// ^> I/Dkl)- The more similar the two distribu- 
tions, the more samples are required. 
How do measurements improve with more points?: Like 
the likelihood, the KL divergence D K l(p*\p) has a local 
extremum. Thus, as with the likelihood, if the collection 
of indistinguishable models is a small ellipsoid, the pa- 
rameter volume inside a contour of constant Dkl scales 
as D^ 2 for d the total parameter space dimension. In 
particular, if we choose our contour of constant Dkl to 



satisfy Eq. ( 33 1 , then the parameter volume will scale as 



V oc 1/ M * /2 . 

By contrast, if a wide range of models are indistin- 
guishable, then the distinguishable volume V must scale 
as a smaller power V oc /i* dcff ^ 2 for rf e // < d. 



C. Average versus fluctuating likelihood 3: Models 
with different rates and numbers 

In general, models predict both different numbers and 
distributions of events. As above, on average, each suc- 
cessive independent measurement changes the expected 
log likelihood L of a proposed model X — (p,p) by a 
fixed amount: 



Differences in likelihood between two models therefore 
grow linearly with (average) detected number (i.e., lin- 
early with time or range cubed). 



By contrast, fluctuations in log likelihood do not scale as a simple power of the expected number of events (p*). As 
an example, consider the statistics of a single likelihood, assuming data is drawn from a reference model X* — 

n 

L(X) = lnp{n\»)+^lnp(x k ) (38) 
fc=i 

The standard deviation of SL(X) = L(X) — L(X* ) about its mean value can be directly evaluated. After some algebra, 



9 



we find 

-It - ((SL-(8L)f) = (6L*)-(Ly 

= ((^P{n\^)/P( n \v*)) 2 ) - ( ln P("-lM)/ \np(n\^)) 2 + [2 ((n - fu)hxp(n\fi)/p(n\fi*))] (lnp/p*) + /i* ((lnp/p*) 2 ) 

= ^ [|ln^| 2 + ((lnp/^) 2 )+2(ln A1 / Alst )(lnp/p st )] 

= M * [(ln/i//i* + (InpM)) 2 + ((lnp/p, - (hip/p*)) 2 )] (39) 



where in the final line we regroup terms into a sum of two 
manifestly positive-definite quantities. In the second to 
last expression, the first two terms have been described 
previously and characterize fluctuations in 5L when the 
two models predict different event rates or parameter 
distributions, respectively. The new third term simply 
reflects necessary correlations, implied by the fact that 
more samples will necessarily reveal more shape infor- 
mation on average. Critically, all three terms scale in 
proportion to /i*, when the relative event rates 
are fixed. 

Precisely as before, the condition 1 ~ <rg L / (SL) de- 
fines a surface in model space, allowing us to determine 
how many measurements are needed to distinguish two 
models. Using — ln(/x/^*) and q p = Dkl(p*\p) > as 
local coordinates, this expression identifies a mishshapen 
box about the origin in the q p ,q p half-plane. Evidently, 
in the limit q p <C q p - for models that predict similar 
event rate - the constraint reduces to Eq. (27). Con- 
versely, in the limit q p <C q p - for models that predict 
similar parameter distributions - the constraint reduces 



to Eq. (33). Figure [l] shows an example of the constraint 
surface, for a family p of gaussian distributions with fixed 
variance and arbitrary mean. 

Whether calculated crudely (as a rectangle in q^, q p ) or 
more precisely, the parameter volume inside the contour 
scales as the local dimension. If the constraint contour 
is sufficiently small to be ellipsoidal, then the constraint 
volume must scale as 



V ~ 1/fJL 



d/2 



(40) 



where d is the total number of parameters in the model 
(e.g., all shape parameters plus the event rate parameter 
g„). For weak constraints, the constraint volume will 
scale according to some smaller effective dimension. 



IV. APPLYING CONSTRAINTS TO A TOY 
MODEL FOR BINARY FORMATION 

Up to now, our discussion applies to any data mod- 
cling problem for poisson-distributed events with costly 
theoretical predictions. In this section we will develop 
a concrete example, using distributions, scales, and pa- 
rameters motivated by gravitational wave data analysis. 
To remove any ambiguity or approximation, we choose 
analytically tractable parameters (i.e., one-dimensional 
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FIG. 1: Constraint contour: Contours of the ratio 
gsl/ (5L) versus q p = D K l(j>*\p) an d <7m = hi^/^i, for a 
simple two parameter model, with an arbitrary event rate 
fj, (shown here for p = 100) and gaussian parameter dis- 
tribution p of unit standard deviation and arbitrary mean. 
To a good approximation, the ratio (< 8L > /ash) 2 is 
M* x (?m/2 + q P ) 2 V((<?/j — Ipf + 2g p ). Contours shown cor- 
respond to a I (SL) = l/y/2, 1, \/2, going from outermost in- 
ward. Models with more degrees of freedom have constraint 
contours in a higher-dimensional submanifold, involving more 
parameters than just the simple "shape and rate" diagnostics 



gaussian chirp mass distributions). Using this exam- 
ple, we illustrate the ambiguity region versus likelihood 
threshold, emphasizing how that region changes shape 
and effective dimension. We explain how a small number 
of events constrain only one dimension; how many events 
constrain all dimensions; and, depending on parame- 
ters, an intermediate sample size can constrain differ- 
ent things. We also use of principal component analysis 
to identify and rank relevant mass distribution changes. 
In this case, the effective dimensionality identified with 
these two techniques (volume scaling and principal com- 
ponents) is comparable. 
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A. Model family 

The next generation of gravitational wave detectors 
should soon recover several tens to hundreds of events 
per year [5HS1 ES] ■ Subsequent detector generations with 
cosmologically significant reach should recover hundreds 
of thousands of events per year [47^149] . Given several 
orders of magnitude range, we treat the expected sensi- 
tivity of the detector as a tunable parameter and discuss 
scenarios with 10, 10 2 , 10 4 candidate events. 

Gravitational wave detectors are exceptionally sensi- 
tive to massive binaries formed throughout the unvierse 
O HSj, including exceptionally rare but massive black 
holes formed in low metallicity environments [7H5]. If 
binary black holes are included, theory currently only 
weakly constrains the space of plausible detected mass 
distributions. 9 Observations have far more tightly con- 
strained neutron star masses and the binary neutron star 
mass distribution [37]. Motivated by that sample, for 
simplicity we adopt a single gaussian (chirp) mass distri- 
bution centered on a preferred peak value. 10 For simplic- 
ity and owing to detector limitations, we limit our dis- 
cussion to a single mass measurement per binary. While 
gravitational wave detectors in principle constrain both 
component masses and spins, strong correlations exist be- 
tween most recovered parameters |50j . By contrast, even 
allowing for these correlations, the chirp mass of a neu- 
tron star binary is invariably recovered with an accuracy 
much smaller than the mass distribution's width. 

We therefore examine the following model space, pa- 
rameterized by A = {Ir,x,o~} and a common parameter 



p{x) 



e lR jl* 



(41) 
(42) 



where the common factor ji* controls the overall experi- 
ment sensitivity (e.g., duration and range). The KL di- 
vergence between two one-dimensional gaussians has the 
simple form 



Dkl(p*\p) 



In 



lnp*/p 

a {(x-x*) 2 ) | {(x~x) 2 ) 



<7* 2 2cr 2 



2(T 2 2a 2 
(x - x*) 2 + a 2 



(43) 



Combining this expression and Eq. (10), we can express 



9 Observations of galactic black hole binaries can constrain the 
number and nature of low mass black hole binaries; see Farr 
et al. [18) . 

10 The chirp mass is Ai c = (mim2) 3 / 5 /(mi+rri2) 1/ ' 5 . For a canon- 
ical 1.4Mq + IAMq double neutron star binary, the chirp mass 
is ~ 1.2M . 



the average log likelihood difference between the refer- 
ence model and any proposed model as 



(L(X)-L(X*)) 



-D KL ((t*\fi) - H*D KL {p*\p) (44) 



e R - l R + lna 



2a 2 



where without loss of generality we adopt (x*,cr*) = 
(0,1). Finally, after some algebra, the standard devia- 
tion of the log likelihood about this mean value is 



0~5L 



6x + x 
4ct 2 



3 + 41r(1 + I r ) -4(l + 2j/-lner) In a 

4~ 



2a- 



- [3 + x 2 + 2(1 + x 2 )l R - 2(1 + x 2 ) In 



a- 



As we have seen previously, the interpretation of model 
constraints is simpler in coordinates = ln/x/^i* and 
q p = Dkl (p* \p) that characterize how strongly the event 
rate (g M ) and parameter distribution (q p ) differ from the 
true value. By eliminating x and n, we can explicitly 
re-express both (SL) and asL in terms of q p , q^ and a: 



(L) = ^[q p + e lR + h 



2(l R - q P ) 2 
r-4(J- 2 (l + lncr)] 



(46) 



(47) 



Not all parameter combinations are realizable: q p must 
not only be positive, but it must be greater than 



q P ~ Q(v) — In o 



2d 2 



(48) 



Figures [I] and [3] show contours of the ratio (L) /ctsl 
versus just q^ and q p , for the special case er = 1. The 
contours isolate a small region around the origin. More 
generally, Figure [2] shows contours of the same function, 
allowing all three parameters to vary. In this more gen- 
eral case, as indicated by the transparent contour in this 
figure, the requirement that q p be greater than a function 
of a excludes everything below a parabolic cut through 
the q p , a plane. 



B. Example 1: Unknown median mass and 
moderately known event rate 

By combining this elementary model with different pri- 
ors, we can illustrate the general properties described ear- 
lier in the text. As a highly unrealistic but illustrative 
example, let us assume a high degree of prior information 
about the event rate, illustrated by the gray band in Fig- 
ure [3j and complete knowledge of the mass distribution 
width cr, but no information about the median value x. In 
this case, the first few detections will primarily inform us 
about the previously completely unknown value x. Early 
on, the model space has one effective dimension. With 
only a handful of detections available, we can effectively 



(45) 
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O.tM 

FIG. 2: Constraint contour (3d): For the three-parameter 
detection model described in the text (event rate, average 
mass x, and mass distribution width sigma), contours of 
(5L) /asL versus g M (quantifying how different the event rate 
is from injected), q v (quantifying the difference in shape) and 
a. In these coordinates, q p must be greater than a certain 
function of a, indicated by the transparent contour and de- 
scribed in the text. 



0.05 - 
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Sin fi 

FIG. 3: Why the effective dimension increases: As Fig- 
ure [l] except we overlay a band suggesting some a prior prior, 
here on event rate. 



treat the event rate as known. Graphically in Figure [3j 
the prior region defines a surface much narrower than or 
constraint contours. As those contours shrink with in- 
creasing numbers of events, we gain information only in 
reduced "vertical" extent (via a smaller range allowed for 
Qp) 

Eventually, once the number of detections N is large 
enough to pin down the rate to better than the prior 
range Aln^i (i.e., N > (Aln^)~ 2 ), successive observa- 
tions provide information about both the mass distribu- 
tion and the event rate. Successive observations now fully 
constrain all two dimensions of this model. 

The transition from an effectively one-dimensional to 
two-dimensional problem is easily identified from the 
number of models with different likelihoods, consistent 
with the prior. For large likelihood differences, the num- 
ber of models scales as V oc (AL) 1 / 2 , where by "number" 
I mean the volume relative to (x,/J,) space in uniform 
measure. By contrast, for small likelihood differences, 
the number of models scales as V oc (AL). Finally, in 
the neighborhood of the optimum point, a principal com- 
ponent analysis also identifies the dominant shape (the 
normal distribution p) and shape variation (a function 
oc d x p, corresponding to a shift in local maximum). 

To demonstrate this method unambiguously, we have 
performed a Monte Carlo study, where we explicitly 
generate synthetic gaussian realizations drawn from our 
model, estimate L for a large collection of candidate pa- 
rameters, sort models by log likelihood, and use the ad- 
hoc AL threshold proposed earlier to estimate confidence 
intervals. The results of this toy model study behave pre- 
cisely as expected and described above. 



C. Example 2: Weak three-dimensional prior 

For binary neutron stars, by contrast, the event rate 
is very weakly constrained; by contrast, observations of 
pulsars and X-ray binaries in the Milky Way relatively 
tightly constraint the mass distribution (mean and vari- 
ance) for merging binary neutron stars |371 151) . Taking 
the tight neutron star mass distribution at face value, we 
can use Figure[2]to determine the point at which gravita- 
tional wave measurements will provide new information 
about the mass distribution. For example, assuming the 
(relative) variance a is known to 10% and x to 5%, our 
prior volume in shape parameters is small: q p less than 
10~ 3 and a ~ [0.9,1.1]. In this case, the first gravita- 
tional wave measurements will provide information only 
about the event rate, until at least N ~ 100 events are 
observed; see Figure [l] which is scaled to precisely 100 
events. Of course, this condition could equally well and 
far more easily be derived from an understanding of gaus- 
sian statistics. Our methods, however, generalize to ar- 
bitrary distributions. 

Alternatively, the (chirp) mass distribution for binary 
neutron stars could be substantially wider and biased 
[181 137j : the relevant supernova mechanisms could differ, 
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accretion (slightly) changes each compact object's mass, 
et cetera [9l [24] . More conservatively, one can adopt a 
mass distribution with median known to only 20% and 
variance known to 50%. Again, cither using gaussian 
statistics or Figure[2] one can deduce that far fewer events 
are needed before gravitational wave observations will 
constrain the neutron star mass distribution. 



V. IMPLICATIONS FOR BINARY EVOLUTION 
AND GRAVITATIONAL WAVE ASTRONOMY 

Using the above analysis as a guide, we can anticipate 
how future gravitational wave detection events will in- 
form our understanding of binary evolution. 



Tight constraints and rapidly improving 
performance 



As parameter distributions potentially encode in- 
finitely many degrees of freedom, these distributions can 
completely encode all the details of formation scenar- 
ios. Of course, our ability to directly constrain the event 
rate and distribution (nonparametrically) is highly lim- 
ited by the number of events. Realistic model spaces are 
far smaller, however, allowing us to establish extremely 
tight constraints with a relative handful of events. The 
fraction of models consistent with the data decreases as 
a high power of the number n of measurement events - 
conceivably, as fast as l/n d / 2 (i.e., each parameter is in- 
dependently constrained to an accuracy 1 /\/n) . This ex- 
ponent reflects the (local) complexity of the model space, 
averaged over the set of predictions that cannot be dis- 
tinguished from our data. As observations discriminate 
between finer details, the exponent increases, as the vol- 
ume of indistinguishable models grows smaller. 

Unfortunately, the degree and rate of improvement de- 
pends strongly on what nature provides. To use a simple 
example, assume only the number of events can be mea- 
sured, then compared to some single-peaked distribution 
like the ones provided in O'Shaughnessy et al. [8 . Ob- 
servations that support the most plausible value are least 
informative: most predictions cluster there. Subsequent 
observations improve our certainty in the event rate by 
1/y/n and reduce the fraction of consistent models by the 
same proportion. At the other extreme, if observations 
support an event rate near the limits of what our models 
predict, only a tiny fraction of models can be consistent. 
A range of event rate exist where further observations 
will reduce the fraction of models consistent with obser- 
vations faster than l/s/n. 

To enable the tighest constraints, event rates and mass 
distributions should depend sensitively on several model 
parameters. Previous studies suggest both the number 
of events and their masses depend sensitively on assump- 
tions [51152]. 



B. Rate and shape as two coordinates 



In this paper we propose adopting two specific coordi- 
nates in the neighborhood of any point in a proposed 
compact binary progenitor model parameter space: a 
"log rate" coordinate ln^ and a "change in distribu- 
tion shape" coordinate D KL (p* |p) . We recommend these 
coordinates because they are both calculable and in di- 
rect relation to an intuitively obvious coordinate: the 
expected log likelihood difference between a model and 
the truth [Eqs. (10 1 or (44 f]. The full model space can 
be parameterized with these two and any remaining d— 2 



coordinates. Qualitatively, both Dkl{p*\p) or (L) define 
isocontours that surround the local extremum: they are 
the quantity being extremized and thus more like a "sep- 
aration squared" than a coordinate free to take on any 
value. 

Our coordinates quantify the effort needed to distin- 
guish rate and shape. The KL divergence Dk jl (p* |p) 
quantifies on average how many measurements /i* are 
required to distinguish two similar distributions (/i* ~ 
1 / Dkl (j>* \p) ; see Section III B I . On the other hand, Pois- 
son errors naturally limit how reliably we can measure 
the log of the event rate (5 In fj, cx 1/ -y/Ju; see Section 
III A). Finally, these coordinates let us pheneomenologi- 



cally identify degrees of freedom. 

While in practice we recommend these coordinates 
to phenomenologically identify relevant degrees of free- 
dom, for conceptual purposes we strongly recommend 
one think in terms of some underlying parameters A and, 
as necessary, a Fisher matrix to relate them to expected 
log likelihoods and Dkl [Eq. (15)]. Why? Conceptually, 



the Dkl coordinate is an (expected) log likelihood dif- 
ference between the best-fit model and a candidate. As- 
suming we parameterize our model space with (L) and 
d — 1 other coordiantes, each choice of d — 1 coordinates 
defines some submanifold, which has a point of "clos- 
est approach" (i.e., largest likelihood) to the (unique) 
best-fitting model. In other words, for each combination 
of the non-likelihood coordinates, a maximum value of 
(L) exists, leading to an excluded region in the coordi- 
nate space mirroring the local extremum, as in Figure [2] 
Equivalently, this choice of coordinates has a singular Ja- 
cobian at the local extremum: Dkl — smoothly there 
[Eq. (19)]. For this reason, the coordinate D cannot be 



used when counting dimensions for the volume scaling 
arguments (i.e., V cx /i* d ^ 2 ) unless the Jacobian is taken 
into account and suitably integrated. 

We anticipate models will be easiest to distinguish 
when we make full use of all available information. In 
the language of the previous sections, the KL divergence 
Dkl between the predictions of two sets of model param- 
eters A will increase when as many predictions as possible 
are included in the measurement space x. 
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C. Practical monte carlo to differentiate between 
models 

To this point our analysis has bene purely theoretical. 
We will provide a more concrete analysis in a subsequent 
publication. However, in a previous study [31], we have 
already used this method to quantify how often different 
members of a small model population could be distin- 
guished; see, e.g., their Figure 2. In that study, syn- 
thetic data was generated according to each compact bi- 
nary model. Each synthetic data set was compared with 
all its neighbors; if the likelihood difference was large 
enough compared to the expected magnitude of fluctua- 
tions, they were assumed distinguishable. 

Based on our discussion above, this study's results 
make perfect sense. First and foremost, because differ- 
ent models predicted different numbers of events, mod- 
els that predicted many events had only a handful of 
neighbors, at best, that could never be distinguished from 
them. Moreover, as the predicted number of events in- 
creased, the fraction of distinguishable models decreased, 
as expected (i.e., the fraction should scale as 1/N de ff' 2 
for d e ff some effective dimension). Unfortunately, the 
local error ellipsoid and therefore effective dimension de- 
pend on the reference point. As a result, this study did 
not clearly identify a single trend (i.e., a fraction decreas- 
ing as a single power of the event number): instead, it 
saw the superposition of several trends. This approach 
was also severely limited by the small number of high- 
precision mass distributions used (~ 240). 

In this paper, we have outlined further tools to charac- 
terize a similar discrete sample of simulations: the effec- 
tive dimension and a local principal component analysis. 
Both can be computed from a discrete sample. For ex- 
ample, the discrete cumulative log-likelihood distribution 
P(< SL) found by comparing a synthetic data set to all 
models can be approximated by a power law in the neigh- 
borhood of SL ~ 0. The exponent is the effective dimen- 
sion. Similarly, given a proposed (parameter-dependent) 
threshold on SL, we can always find the set of models 
inside that contour, then perform a principal component 
analysis on the discrete collection of distributions p. This 
decomposition tells us what types of variations to expect 
in that (not always small) neighborhood. We have per- 
formed these calculations for test simulations and will 
provide detailed analysis in a subsequent publication. 

D. Measurement errors 

For simplicity, we have ignored the role of measure- 
ment error. Gravitational wave detectors will be able 
to tightly constrain some parameters, such as the "chirp 
mass" (mim2) 3 / 5 /( TO i + m 2) 1 ^-, as these dramatically 
impact the binary's rate of inspiral. In fact, the infinites- 
imal uncertainty in each chirp mass measurement will be 
far smaller than any expected features in the binary chirp 
mass distribution. 



Beyond this leading order dependence, however, very 
few parameters can be constrained precisely. In direct op- 
position to the chirp mass, the measurement accuracy for 
other parameters - mass ratio, spin magnitudes, and spin 
orientations - is often comparable to or larger than the 
expected width of these features: individual gravitational 
wave measurements provide fairly little information [50) . 
For the least-influential parameters, such as antisymmet- 
ric combinations of spins, successive measurements will 
only constrain our measurement uncertainty, not the un- 
derlying astrophysical distribution. Finally, in several 
cases, systematic limitations in our ability to correctly 
model the long-lived signal from inspiralling, precesing 
compact binaries prevent us from accurately reproduc- 
ing source parameters |53j . 

For these parameters, more delicate comparisons of 
predicted distributions are warranted, that account for 
these measurement errors. A detailed analysis of param- 
eter estimation is far beyond the scope of this paper. 
To leading order, however, we anticipate that param- 
eter estimation uncertainties can be folded directly in 
to the "predicted parameter distribution" p(x). More 
concretely, if Po(x) is the posterior prediction of a par- 
ticular binary evolution model and K{x\x') is the con- 
ditional probability of recovering binary parameters x 
given a measurement x', then we assess the similar- 
ity of the (detector-dependent) predicted distributions 
p(x) = J K(x\x')p (x). As a first approximation, the 
conditional probability distribution can be estimated us- 
ing standard Fisher matrix techniques |50j . 

The fraction of models consistent with observations 
can decrease so rapidly that it presents severe computa- 
tional and data analysis challenges. For example, we need 
to distinguish between distributions differing by AD ~ 
l/n de ^/ 2_1 . To do this correctly, we must evaluate the 
likelihood correctly, which in turn requires us to extract 
parameters for each event; to determine (data-analysis- 
strategy-dependent) selection biases for each type of bi- 
nary; et cetera. When the effective dimension is large, 
very small uncertainties about (or errors in) any stage 
in our data analysis pipeline can easily contaminate the 
distribution comparisons described above. 

In fact, computational challenges also occur even in the 
absence of measurement error. Model parameter distri- 
butions in binary evolution are often sampled by Monte 
Carlo. Because very high precision parameter distribu- 
tion predictions are needed to distinguish neighboring 
models, the number of test binaries one must simulate 
can scale as a prohibitive power of the target likelihood 
accuracy. 



E. Reconstructing distributions from data? 

In this paper we have outlined a simple way to char- 
acterize model distribution differences, motivated by a 
simple maximum-likelihood statistic to differentate be- 
tween proposed models. This approach effectively char- 
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acterizes theoretical differences between predictions but 
would make suboptimal use of real gravitational wave 
data. For example, our method treats each detection at 
a point estimate with perfect confidence, not allowing for 
marginally significant events. 

Another approach to gravitational wave data analysis 
attempts to reconstruct the properties of the detected 
signal distribution, without relying on any underlying 
models at all. This quasi-nonparametric approach has 
been applied both to hard astronomical data astronomi- 
cal problems [18] and the as-yet-hypothetical problem of 
distinguishing binary mass distributions |32) . In this ap- 
proach, one could employ the results of detailed Markov 
Chain Monte Carlo studies (for example) to extract opti- 
mal posterior distributions from each measurement, then 
adjoin them to estimate the overall population. Such an 
approach could include marginal events and make full use 
of the available information per event. 

These two techniques address fundamentally different 
questions, similar to nonparametric and modeled func- 
tion estimation. In general if systematic biases are small 
and controlled, a modeled approach should permit more 
precise constraints. The method described in this pa- 
per naturally selects model parameter distributions with 
greatest support (directly and via principal components) , 
as well as the plausible range of variations. However, 
given the complexity of the model space, nonparametric 
(unmodeled) distribution estimates provide a vital cor- 
roborating test of model-fit results. We will address the 
problem of comparing modeled and unmodeled parame- 
ter estimates in a subsequent paper. 



VI. CONCLUSIONS 

The astrophysical interpretation of gravitational wave 
data faces a common dilemma: observations produce a 
small data set (there, compact object numbers and pa- 
rameters) which need to be compared to expensive, com- 
plex predictions. In this paper, we introduce methods 
to facilitate the identification and interpretation of these 
kinds of comparisons, using gravitational wave chirp mass 
measurements as an example. To better understand the 
model space, we suggest principal component analysis 
on small model subsets, to nonparametrically identify lo- 
cal (highly variable) degrees of freedom that impact the 
mass distribution. We also propose a new family of (lo- 
cal) coordinates on the model space. Rather than choose 
simulation input parameters, these coordinates are nat- 
urally adapted to the physically identifiable degrees of 
freedom, as characterized by the range of predicted pa- 
rameter distributions. Using these coordinates, we can 
transparently address how much information each suc- 
cessive observation provides. 

Previous studies suggest that predicted parameter dis- 
tributions have a range of differences, from dramatic to 
minute. For suitable coordinates, one population param- 
eter Ai may have a dramatic effect on the population 



(e.g., the location of the dominant peak) while others 
may have far less notable impact (e.g., the location of a 
very small feature). Evidently, only a handful of obser- 
vations are needed to identify and measure the first vari- 
ation; many measurements are needed to recognize the 
second variation exists. Our techniques correctly identify 
that these scales exist; determine how many events are 
needed to resolve them; and, for each number of events, 
characterize the relevant number of degrees of freedom. 
In particular, we can predict how well each successive 
event distinguishes a model from its neighbors: the frac- 
tion of consistent models decreases as a power n~ d " ff / 2 
for d e ff an effective dimension characterizing the local 
model space. This fraction can decrease very rapidly if, 
as expected, the predicted mass and spin distributions 
differ significantly between realizations. Past a certain 
point, data analysis pipelines and progenitor model sim- 
ulations must take great care to insure they provide suf- 
ficiently high-accuracy parameter distributions, to best 
exploit the available information. At the other extreme, 
our principal component technique lets us identify when 
the model space is large and a purely phenomenological 
approach is needed. Our methods can incorporate esti- 
mates for experimental uncertainty. In short, we present 
a concrete way to deal simultaneously with experimen- 
tal uncertainty and model complexity: we let the model 
space itself identify what is important. 

Our discussion is broadly applicable to experiments 
where on the one hand experiments measure several prop- 
erties of each event but on the other hand theory cannot 
produce comparable multivariate distributions without 
significant computational cost. Our method is statisti- 
cally well-posed; applies equally well to coarse or fine 
sampling of the model space; can employ as many or few 
dimensions as needed; can be easily modified to account 
for experimental error; and asymptotes to a maximally 
informative Fisher matrix analysis in the limit of small 
statistical and systematic errors. 

In a subsequent publication, we will use concrete bi- 
nary evolution codes like BSE and StarTrack to char- 
acterize the relevant degrees of freedom for binary com- 
pact objects. Based on their concrete example, we will 
assess the computational requirements for future simula- 
tions needed to take full advantage of future gravitational 
wave observations by advanced ground-based detectors 
like advanced LIGO, advanced Virgo, and the Einstein 
Telescope. In another publication we will use pertur- 
bative calculations to assess the limits of this approach, 
showing how third-generation detectors can provide ex- 
tremely high-precision estimates for binary evolution pa- 
rameters. 



Appendix A: Principal component analysis 

The natural eigendircctions associated with a data set 
can be characterized by principal component analysis 
[45] . In one sense, principal component analysis corre- 
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sponds to finding the best set of N orthonormal basis 
functions 4>k (x) with which to approximate a set of N samp 
functions Spa, in that the sample-summed error is small- 
est: 



global error = \\Spa — $Pa\\ x 



(Al) 



leads to an eigenvalue problem for the correlation op- 
erator C [Eq. pOh]. 
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